Introduction

The domain that we are focusing on is Twitter data as it is a highly popular social media platform that gives us an insight to individuals mindset through short text summaries called tweets. These individuals include influential public figures such as senators, as well as, any person that chooses to use the platform. This gives us data on social media usage in a standardized format that will allow us to draw conclusions about a wide variety of individuals. Twitter is also an optimal choice because it is a primarily text based allowing us to analyze information rather than missing a whole subsection of context such as Instagram’s focus on pictures.

Summary Information

For the senators dataset, the length of this data is 10. The average number of replies for all of those tweets is 41.9017584. We also analyzed the maximum and minimum number of favorites those tweets received, and the highest number of favorites is 2108865, while the lowest number of favorites is just 0. The largest number of retweets a tweet recieved is 3644423, and the fewest number of retweets is also 0. At last, we analyzed which party posted most tweets in this period, and that is D, which represents the democratic party.

For the russian trolls dataset we used, the length of this data is 21. We focused on the users more this time. The average of number of followers those users have is 2256.3982189, and they follow 2008.342079 people in average. The user who has the highest number of followers has 23890 followers, but the user who has the lowest number of followers doesn’t have any followers, with a number of 0. The most committed user, the person who posted the most tweets, is AMELIEBALDWIN.

Last but not least, for the sentimental dataset, the length of this data is 6, and the columns are: target, id, date, flag, user, text. There are a total number of 100002 for tweets that indicate positive mood, and there are 299998 negative mood tweets. We also analyzed the most prefered and least prefered day in a week to post tweets in general, and the most popular day is Tue, while the least prefered day to post tweets is Sun.

Summary Table

Number of Hashtags Average Number of Retweets
0 373.56405
1 113.42351
2 72.83565
3 58.58104
4 57.03279
5 57.13406
6 55.04412
7 15.50000
8 499.00000
9 1.00000

From the table we can see that there is a negative relationship between the number of hashtags in a tweet and its average number of retweets in general (except for the number of 8, which can be seen as an outlier). We employ this table to give a sense of how different number of hashtags influence the corresponding average number of retweets in an intuitional way.

Regions Number of Tweets
United States 159369
United Arab Emirates 9598
Italy 6278
Azerbaijan 6238
Russian Federation 1295
Ukraine 1232
Israel 675
Germany 656
United Kingdom 600
France 22
Iraq 17
Turkey 9
Japan 2
Serbia 2
Egypt 1

From this table we can directly see each region’s total number of tweets in a descending order. The U.S. has the largest number of qualified tweets in this period and Egypt has the lowest number. We choose to employ this table because it is able to provide information about different countries’ number of tweets in an extremely clear way.

Is the Mood Positive? Number of Tweets
FALSE 299998
TRUE 100002

Table obtained from this dataset is a bit hard to compute. However, this table can present a direct illustration about the total number of tweets that are in positive mood as well as the total number of tweets that are in negative mood. It is meaningful because this table provides some deep idea about tweets, or social media in general, which is people’s tendency to post something negative rather than positive.

Does a Tweet’s Hashtags Affect its Retweets?

This bar chart was intended to show the relationship between the number of hashtags (discrete) that a tweet receives and the number of retweets (discrete) that it gets.

As we can see from the chart, the overall trend (excluding the outlier of eight hashtags) is a significant decline in average number of retweets per tweet for each hashtag used a U.S. Senator’s post. This trend is not consistent with the number of retweets for the eight hashtag tweets, however, there were only six tweets with eight hashtags and one tweet happened to have thousands of retweets which heavily skewed its average result. Another interesting observation is that when there were no hashtags in a tweet it received a much higher number of average retweets and that by adding even one hashtag to a senator’s tweet it greatly decreases their average number of retweets. This is interesting because hashtags are intended to increase visibility and promote awareness, however, in this small sample we see the opposite trend of decreased exposure that could be attributed to a variety of outside factors.

How time affects mood

##       0%      25%      50%      75%     100% 
##  0.00000  6.40000 11.90000 18.81667 23.98333
##        0%       25%       50%       75%      100% 
##  0.000000  4.483333 10.283333 18.333333 23.983333

*Note: After the box plot, the first box is the statistics for the negative mood plot and the second box is the statistics for the positive mood.

The reason why a box plot is good at displaying how time affects the mood is because it shows the median hour, all the quartiles, minimum, and maximum values which should provide quick statistics for any reader who just wants the quick result. Based on the box plot and the summary below, the median for negative mood tweets is at hour 11.9 (11:54) and the median for positive mood tweets is at hour 10.2833333 (10:17). The box plot for negative mood is skewed to the later time in the day which could indicate that due to a long work day or school day, people tend to be in a negative mood. The opposite is expected for positive mood as people are just freshed out of bed.

Most popular weekday to post to Twitter based on the users’ region

A pie chart is suitable to present this data since it shows which day has the most number of tweet out of the total number of tweet created in the whole week. We can see that the data is somewhat equally distributed in the chart, sometimes increased slighly in weekend and midweek. This could be explained by the way people easily gain access to twitter, which is through personal devices like smartphones or computer.